The Twofish Encryption Algorithm: A 128-Bit Block Cipher The Twofish Encryption Algorithm: A 128-Bit Block Cipher
by Bruce Schneier ; John Kelsey ; Doug Whiting ; David Wagner ; Chris Hall ; Niels Ferguson
Wiley Computer Publishing, John Wiley & Sons, Inc.
ISBN: 0471353817   Pub Date: 03/01/99
  

Previous Table of Contents Next


5.4 Performance on Future Microprocessors

Given the ever-advancing capabilities of CPUs, it is worthwhile to make some observations about how the Twofish algorithm will run on future processors, including Intel’s Merced. Not many details are known about Merced, other than that it includes an Explicitly Parallel Instruction Computing (EPIC) architecture, as well the ability to run existing Pentium code. EPIC is related to VLIW architectures that allow many parallel opcodes to be executed at once, while the Pentium allows only two opcodes in parallel, and the Pentium Pro/Pentium II may process up to three opcodes per clock. However, access to memory tables is limited in most VLIW implementations to only a few parallel operations, and we expect similar restrictions to hold for Merced. For example, an existing Philips VLIW CPU can process up to five opcodes in parallel, but only two of the opcodes can read from memory.

Since Twofish relies on 8-bit non-linear S-boxes, it is clear that table access is an integral part of the algorithm. Thus, Twofish might not be able to take advantage of all the parallel execution units available on a VLIW processor. However, there is still plenty of parallelism in Twofish that can be well-utilized in an optimized VLIW software implementation. Equally important, the alternative of not using large S-boxes, while it may allow greater parallelism, also naturally involves less non-linearity and thus generally requires more rounds. For example, Serpent [BAK98], based on “inline” computation of 4-bit S-boxes, may experience a relatively larger speedup than Twofish on a VLIW CPU, but Serpent also requires 32 rounds, and is considerably slower to start with.

It should also be noted that, as with most encryption algorithms, the primitive operations used in Twofish could very easily be added to a CPU instruction set to improve software performance significantly. Future mainstream CPUs may include such support for the new AES standard. However, it is also worthwhile to remember that DES has been a standard for more than twenty years, and no popular CPU has added instruction set support for it, even though DES software performance would benefit greatly from such features.

5.5 Hardware Performance

No actual logic design has been implemented for Twofish, but estimates in terms of gates for each building block have been made. As in software, there are many possible space-time tradeoffs in hardware implementations of Two-fish. Thus, it is not meaningful to give just one figure for the speed and size attributes of Twofish in hardware. Instead, we will try to outline several of the options and give estimates for speed and gate count of several different architectures.

For example, the round subkeys could be precomputed and stored in a RAM, or they could be computed on the fly. If computed on the fly, the h function logic could be time-multiplexed between subkeys and the round function to save size at a cost in speed, or the logic could be duplicated, adding gates but perhaps running twice as fast. If the subkeys were precomputed, the h function logic would be used during a key setup phase to compute the subkeys, saving gates but adding a startup time roughly equal to one block encryption time. Similarly, a single h function logic block could be time-multiplexed between computing T0 and T1, halving throughput but saving even more gates.

As another example of the possible tradeoffs, the S-boxes could be pre-computed and stored in on-chip RAMs, allowing faster operation because there is no need to ripple through several layers of key material XORs and q permutations. The addition of such RAMs (e.g., eight 256-byte RAMs) would perhaps double or triple the size of the logic, and it would also impose a significant startup time on key change to initialize the RAMs. Despite these disadvantages, such an architecture might raise the throughput by a factor of two or more (particularly for the larger key sizes), so for high-performance systems with infrequent re-keying, this option may be attractive.

The construction method specified in Section 4.3.5 for building the 8-bit permutations q0 and q1 from four 4-bit permutations was selected mainly to minimize gate count in many hardware implementations of Twofish. These permutations can be built either directly in logic gates or as full 256-byte ROMs in hardware, but such a ROM is usually several times larger than the direct logic implementation. Since each full h block in hardware (see Figure 4.2) involves six q0 blocks and six q1 blocks (for N = 128), the gate savings mount fairly quickly. The circuit delays in building q0 or q1 using logic gates are typically at least as low as those using ROMs, although this metric is certainly somewhat dependent on the particular silicon technology and circuit library available.


Previous Table of Contents Next